Entropic selection of concepts in networks of similarity between documents
نویسندگان
چکیده
Scientists have devoted many efforts to study the organization and evolution of science by leveraging the textual information contained in the title/abstract of scientific documents. However, only few studies focus on the analysis of the whole body of a document. Using the whole text of documents allows, instead, to unveil the organization of scientific knowledge using a network of similarity between articles based on their characterizing concepts which can be extracted, for instance, through the ScienceWISE platform. However, such network has a remarkably high link density (36%) hindering the association of groups of documents to a given topic, because not all the concepts are equally informative and useful to discriminate between articles. The presence of “generic concepts” generates a large amount of spurious connections in the system. To identify/remove these concepts, we introduce a method to gauge their relevance according to an information-theoretic approach. The significance of a concept c is encoded by the distance between its maximum entropy, Smax, and the observed one, Sc. After removing concepts within a certain distance from the maximum, we rebuild the similarity network and analyze its topic structure. The consequences of pruning concepts are twofold: the number of links decreases, as well as the noise present in the strength of similarities between articles. Hence, the filtered network displays a more refined community structure, where each community contains articles related to a specific topic. Finally, the method can be applied to other kind of documents and works also in a coarse-grained mode, allowing the study of a corpus at different scales.
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملResilient Supplier Selection in a Supply Chain by a New Interval-Valued Fuzzy Group Decision Model Based on Possibilistic Statistical Concepts
Supplier selection is one the main concern in the context of supply chain networks by considering their global and competitive features. Resilient supplier selection as generally new idea has not been addressed properly in the literature under uncertain conditions. Therefore, in this paper, a new multi-criteria group decision-making (MCGDM) model is introduced with interval-valued fuzzy sets (I...
متن کاملSimilarity measurement for describe user images in social media
Online social networks like Instagram are places for communication. Also, these media produce rich metadata which are useful for further analysis in many fields including health and cognitive science. Many researchers are using these metadata like hashtags, images, etc. to detect patterns of user activities. However, there are several serious ambiguities like how much reliable are these informa...
متن کاملA procedure for Web Service Selection Using WS-Policy Semantic Matching
In general, Policy-based approaches play an important role in the management of web services, for instance, in the choice of semantic web service and quality of services (QoS) in particular. The present research work illustrates a procedure for the web service selection among functionality similar web services based on WS-Policy semantic matching. In this study, the procedure of WS-Policy publi...
متن کاملخوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1705.06510 شماره
صفحات -
تاریخ انتشار 2017